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ABSTRACT 

Direct assessment of writing skill, usually considered to be 
synonymous with assessment by means of writing samples, 
is reviewed in terms of its history and with respect to evi- 
de.ice of its reliability and validity. Reliability is examined as 
it is influenced by reader ihcdhsistency, domain sampling, 
ana other sources of error. Validity evidence is presented, 
which shows reported relationships between direct assess- 
ment scores arid criteria such as class rank, English course 
grades, arid instructors' ratings of writing ability: Evidence 
bri the incremental validity of direct assessment over and 
above other available measures is also given. It is concluded 
that direct assessment makes a contribution but that methods 
need to be developed to improve its reliability and reduce its 
costs. New automated methods of textual analysis arid new 
kinds of direct assessment in which more than a single score 
is produced are suggested as two approaches to better direct 
assessment. 



INTRODUCTION 

Over the years writing skill has been appraised through two 
approaches, direct assessment arid indirect assessment. 
Direct assessments are those in whrch a sample of an exam- 
inee's writing is obtained under controlled conditions and 
then evaluated by one or more judges, usually English teach- 
ers trained in making judgments about writing skill. Indirect 
assessments are so termed because an estimate of probable 
skill in writing is made through observations of specific kinds 
of know*.dge about writing, such as grammar and sentence 
structure, although more advanced skills can also be 
observed. These indirect assessments are commonly made 
by means of multiple-choice questions. Thus, direct assess- 
ments tend to be associated with writing samples and indirect 
assessments with multiple-choice questions. Later in this 
review, the distinction between direct and indirect assess- 
ments of writing skill will be reconsidered because the usuai 
distinction rriay be more simplistic than it needs to be. For the 
moment, however, direct assessments will be thought of as 
writing samples evaluated by one or more judges. Indirect 
assessments will not be covered in this review. 

Diederich (1974) probably captured better than anyone 
else the reasoning behind the widespread use of writing 
samples for the assessment of writing skill. 

As a test of writing ability, no test is as convincing to teachers 
of English, to teachers in other departments, to prospective 
employers, arid to the public as actual samples of each student's 
writing, especially if the writing is done under test conditions 
in which one can be sure that each sample is the student s own 
unaided work. People who uphold the view that essays are the 
bhl> valid test of writing abiii y are fond of using the analogy 
that, whenever we want to find out whether younp people can 
swim, we have them jump into a pool and swim. (p. t) 

From this perspective, if one wants to know if any given 
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individual can ^rfdrrri ariy given task, a test of performance 
in that task is what is needed. Cdffrriari (1971a) presented the 
same kind of argurrierit for the academic context; 

The only way to assess the extent to which a student^as 
rriastered a field is to present him with questions or problems in 
the field and see how he performs. The scholar performs by 
speaking or writing. The essay examination constitutes a sam- 
ple of scholarly performance; hence, it provides a direct mea- 
sure of educational achievement, (p. 273) 

The logic of these kinds of arguments is so cogent that 
despite more than a half a century of criticism by educational 
measurement specialists, the essay remains a principal 
means of evaluation in courses of instruction of ail types. In 
recent years, in fact, the essay has gained more and more 
advocates as evidence of a decline in writing skills among 
high school and college students accrues with each day. 
Faced with this, it is difficult to deny that students need more 
exposure to writing whether in the form of instruction or 
examination. 

Also related to direct assessment are issues of national 
impact — the message that is implicitly sent to students and 
teachers by direct assessment used bri a wide scale: If lai je 
nunfibers of students are required to produce compositions for 
assessments important for graduation, certification, or 
admission to higher levels of education, then students will be 
encouraged ib leant composition skills and teachers to teach 
therii. 

Nonetheless, the history of direct writing skill assess- 
ment is a bleak one. As far back as 1880 it was recognized 
that the essay examination was beset with the curse of unre- 
liability (see Huddleston 1954; Follman and Anderson 1967). 
One of the first demonstrations of the reliability problem 
occurred in the 1920s when it was shown that the score a 
student received bri a College Board examination could 
depend more on which reader read his or her paper, or on 
when the examination i was taken, than bri what was actually 
written (Hopkins 1921). 

The reliability problem is perhaps best illustrated by a 
simple example. In 1961 a study was conducted at the Educa- 
tional Testing Service in which 300 essays written by college 
freshmen were rated by 53 readers representing several pro- 
fessional fields (French 1962). Each rater used a nine-point 
scale. The results showed that none of the 300 essays 
received less than rive of the nine possible ratings, 23 percent 
of the essays received seven different ratings, 37 percent 
received eight different ratings, and 34 percent received all 
possible ratings. It was clear from this study that the score 
received was to a large degree dependent upon which expert 
happened to be doing the scoring. 

The severity of the reliability problem noted was accen- 
tuated by the realization that readers represented only one 
source of error. Perhaps greater errors in a direct assessment 
are introduced by the limited sampling of topics on which 
students can write. Furthermore, additional errors are intro- 
duced by a tendency for errors to be correlated (because 
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readers are influenced similarly by extraneous factors such as 
essay length, handwriting quality, and neatness) and by 
interactions among the different sources of error The sources 
of reader error are many. A study by Shepard (1929) showed 
dramatic variations in the scores received by identical essay 
responses differing only in penmanship: Of coarse penman- 
ship is probably less important today, bat there is some 
evidence that it can still affect the score assigned to an essay 
examination (Markham 1976). In another early study, Traxler 
and Anderson (1935) showed that two independent scores 
made by experienced readers of essay examinations agreed 
fairly well for one essay topic but not for a second topic. It 
was also observed that the grades assigned to essays tended to 
be influenced by the grades given to the papers immediately 
preceding. Stalnaker (1936) rioted, in this regard, that 

A "C" paper may be graded "B" if it is read after an illiterate 
theme, but if it follows an 'A" paper, if such can be found, it 
seems to be of "D" caliber: 

The overall impact of the reliability problem manifests 
itself when brie attempts id correlate judgmental scores of 
essays with external criteria for purposes of validation. More 
often than riot, correlations of judgmental scores with other 
measures are lower than would be expected, and this is 
usually caused by the low reliability of the judgmental 
scores: 

.Reliability will be revisited in a later section of this 
review, but it is first of value to review some of the different 
types of writing tasks commonly used in direct assessments 
and the methods used for evaluating them — since the specific 
task and evaluation procedure can influence reliability. 



TYPES OF DIRECT ASSESSMENT 



This section is intended drily as a brief summary of various 
types of direct assessment as background for subsequent 
discussions of reliability, validity, and other issues that at 
times are influenced by the type of assessment. More 
complete (and more precise) treatments of types of 
assessment described here, as well as other types, are given 
in a number of writings by members of the English teaching 
profession (see, for example, Cooper 1977; tloyd-jones 
1977; Myers 1980; Odell 1981). The types of direct 
assessments commonly used may be classified with respect 
to usk types and the method of evaluation used. At times 
the evaluation procedure is closely linked with the task, as 
in primary-trait scoring. Most scoring methods, however, 
can be applied to more than one specific task, though 
modifications may be necessary as the tasks vary. 

Task Types 

Task typjs are infinite in their variety, since they vary not 
only with the topic to be addressed but with the specific kind 



of prompt or stimulus used, the audience to be addressed, 
arid the purpose intended. Prompts may be written, aural, pi- 
pictorial. The aiidiehce arid purpose may be only implicit, 
as when a student writes something to be evaluated by his or 
her teacher or an aribriymbus teacher or group of teachers. A 
task may allow consultation of reference works, silch as 
dictionaries, arid time for revision, editing, arid rewriting. 
Or, it may be a brief, impromptu task, which allows no 
consultation of reference works and no time for rewriting: 
Following are brief descriptions of some well-known types 
of writing tasks. 

Letter 

An examinee might be asked to write a letter of some type: 
to a friend, to the editor of a newspaper, to a potential 
employer, to a company complaining about a product or 
service, and so dri. 

Narrative 

An autobiographical account, a description of a vacation or 
other experience, or a historical description of some other 
type would all be narratives. These narratives could, of 
course, also be written in the form of a letter, and narratives 
can be either real or imaginary: 

Descriptive 

Although a narrative is usually descriptive, the term implies 
the description of a series of events. A piece of writing may 
be simply the description of some object, how it looks, how 
it works, or some other aspect of it, or some other kind of 
description. 

Argumentative 

In this type of task, the examinee is usually asked to take a 
position on some issue arid argue persuasively for that 
position using evidence from his or her own personal 
experience or reading. It is probably the most common task 
type used because it requires the integration of several 
different writing skills: Sometimes this type of task is 
referred to as an "expository-argumentative" task. 

Expressive 

Rather than argue persuasively, the task may be only to 
express one's opinion on some issue or event. While 
expository in nature, this kind of task is usually dis- 
tinguished from a persuasive or argumentative exposition. 

Role-Playing 

One may be asked to assume a role in some situation and 
then to write something (such as a letter or a memorandum) 
for some specific purpose. Examples would include 
responding to an irate customer as a customer relations 
official, or writing a memorandum to a superior or a 
subordinate in an organization. For role-playing tasks, the 
audience and purpose are usually quite clear. 



2 

S 

ERIC 



Precis or Abstract 

A real : life task opsonic importance is that of synthesizing a 
large body of information for .transmittal to ah audience 
different from that intended in the original piece. Scientists 
abstract complex scientific investigations for nonspecialists. 
Diplomats abstract current information about specific 
countries, at times originally written in other languages, for 
use by others. Lawyers synthesize case histories having 
legal precedents in making arguments. Therefore, a useful 
task is to ask students to read something and then to prepare 
a brief precis or abstract of it. 

Diary Entry 

This could be similar to any of the preceding tasks, but the 
fact that it is written for personal use would probably change 
its tone. 

Literary Analysis 

This is a common task used in literature courses and in the 
more difficult English examinations 

Revision or Editing 

Any of the tasks above might be the subject of a task 
requiring revision or editing. 

Evaluation Methods 

Having obtained a response to one or more of the stimuli 
represented in the task types discussed in the preceding 
section, one can then usually choose among a number of 
different methods for evaluation of the response. As noted 
earlier, some evaluation methods are closely tied to the 
stimulus, namely, primary-trait methods. Thus, the task 
may predetermine the evaluation method. Among the 
several different approaches to evaluation, some are more 
widely used than others. The descriptions that follow, it 
should be cautioned, do riot represent a consensus of 
opinion on the meaning of terms. Rather, they are an attempt 
to describe briefly methods about which there is often much 
disagreement. 

Holistic Scoring 

According to Cooper (1977), in holistic scoring "the rater 
takes a piece of writing and either (1) matches it with 
another piece in a graded series of pieces, or (2) scores it for 
the prominence of certain features important to that kind of 
writing, or (3) assigns it a letter or number grade. The 
placing, scoring, or grading occurs quickly, impres- 
sionistically, after the rater has practiced the procedure with 
other raters." Holistic scoring is at times conducted using 
scoring guides, or rubrics. Some practitioners of holistic 
scoring distinguish it from i impressionistic scoring, since the 
latter is viewed as a haphazard, noncontrolled, and 
unmonitored procedure. Holistic scoring is the most widely 
used evaluation procedure. 



Focused Holistic Scoring 

This method is essentially the same as holistic scoring 
except that scores are produced for more than a single 
dimension of the writing sample being evaluated. For 
example, one might score for content and mechanics, or for 
some other specific aspects, the scoring might be dope for 
each dimension after a single reading, or it might be done 
for each separately so as to minimize influences of one focus 
on the other. The number of focuses must of course be 
limited; otherwise, the procedure tends to be more like an 
analytical procedure. As in holistic, scoring, rib counts or 
enumerations of any type are used. Scoring rubrics for each 
of the dimensions focused on, however, may be used. 

Analytic Scoring 

This evaluation procedure is perhaps best exemplified by 
that associated with Diederich (1974). The Diederich 
procedure is based oil a factor analysis of writing samples 
scored by experts representing several different academic 
disciplines. The factors derived were ideas, organization, 
wording, flavor, and mechanics. In some versions of the 
method, mechanics is further divided into usage, punctua- 
tion, spelling, and handwriting. Each factor is rated on a 
scalf* from 5 (high) to 1 (low), and two of the scales (ideas 
and organization) receive a double weighting. Thus it is 
possible to obtain a score as high as 50, or as low as Id. 
Other analytic procedures are described by Cooper (1977), 
Odell (1981), and Follman and Anderson (1967). 

Atomistic Scoring 

Somewhat akin to analytic scoring are methods in which 
detailed enumerations are made of quite a number of 
different features of a p.ece of writing. While certainly 
"analytic" in many senses, it is useful to distinguish 
atomistic scoring from analytic scoring, as described here, 
because it is very different with respect to the detail 
required. One example of an atomistic scoring procedure 
was described by frloss (1982): In this procedure, the total 
number of errors wis counted in each of four categories: 
spelling, capitalization, punctuation, and expression. To 
deve'op a score from 4 hese counts, the total number of 
errors was divided by an index of paper length so as to avoid 
inappropriate penalties for writing more. 

Primary-Trait Scoring 

Mullis (1980) explains that he rationale of primary-trait 
scoring "is that writing is done in terms of an audience and 
can be judged in view of its effects upon the audience." The 
primary, or most important, trait of a piece of writing will 
be the approach used by the writer to reach the audience 
intended. The primary trait of a set of directions, for 
example, "would be an unambiguous, sequential, and 
logical progression of instructions," according to Mullis: 
Another example given by Mullis is a piece of political 
campaign literature intended to persuade a reader to vote for 
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a particular candidate. "A successful campaign, paper wiH 
have certain persuasive trait:, that an unsuccessful one will 
hot have, and these traits will differ from those necessary 
for a successful set of directions/' Mullis notes. For any 
given , task, the scoring directions must be prepared 
beforehand, and they are usable only with that specific task, 

Syntactic Scoring 

Hunt (1977) has popularized a method of gauging syntactic 
maturity which is most often associated with the term; T~ 
una." A T-unU is defined by Hunt as a "single main clause 
plus whatever other subordinate clauses or nonclauses are 
attached to, or embedded within, that one main clause." In 
other words a T-unit is a single main clause and whatever 
else goes with it The T-unit is used, rather than the 
sentence, because it is empirically useful in describing the 
changes that occur in the syntax of writers as they mature. 

Communicative Effectiveness 

In a sense similar in objectives to primary-trait scoring, this 
method of measuring the quality of prose is also concerned 
with the effects it has on an audience. But, operationally, 
the method is very different from primary-trait scoring. 
Hirsch and Harrington (1981) describe the theoretical basis 
for this new method and some of its advantages over 
traditional methods of scoring. The method is also similar in 
some ways to recent approaches bemg taken bv cognitive 
psychologists, in which the theory and structures of reading 
comprehension research are applied to the analysis of text 
(see, for example, Braceweil et al. 1982: Bruce et al. 1982: 
Fredericksen 1983). Usually, an objective index of commu- 
nicative effectiveness, such as reading speed or comprehen- 
sion, is derived for the assessment. 

Automated Scoring 

Another new method of evaluation that is of considerable 
interest is that done by computer. Frase et al. (1981) and 
Mac Donald et al. (1982) describe a computer-based system 
developed at Bell Laboratories that is presently operational. 
A more sophisticated parsing system is under development 
by IBM (see Heidorn et al. 1982). These methods will be 
discussed in more dc.ait in a later section of this review. 

RELIABIUTY OF DIRECT ASSESSMENTS 

Numerous research investigations have demonstrated thai 
direct assessments of writing skill, as usually conducted, 
tend to yield low reliabilities. The sources of error are 
several, but most analyses have focused on two primary 
s. urccs: rater inconsistency and sampling bias. Rater 
inconsistency occurs not only between raters but with the 
same rater from one occasion to the next — even when the 
same writing sample is being scored. Rater variability 
consists of three different components (Coffman 197 ia). 



First, raters differ with respect to leniency. Some may tend 
to score high and others low; thus, the _ level of score 
obtained by any individual examinee depends upon the rater 
or raters assigned to score the responses of that examinee: 
Second, raters differ in the degree to which they have a 
central tendency, an inclination to score near the average. 
Third, different raters have different values that many times 
lead them to assign grossly different scores to the same 
response; 

While less research has been concerned with the 
problem of sampling error, it is probable that sampling is 
also a serious source of error in direct writing assessments. 
A highly reliable writing assessment will require more than 
one writing sample, and each sample will be independent 
from all other samples. Such independence does not occur, 
of course, when several tasks are required that relate to the 
same topic stimulus. The most reliable assessment will 
occur when all of the responses are scored independently by 
different raters. The more the number of independent 
responses and the more the number of independent ratings 
of each response, the greater will be the reliability of the 
assessment. Unfortunately, it has not proved to be 
economically feasible to conduct iarge-sctc writing assess- 
ments using multiple writing samples and multiple indepen- 
dent ratings. For the same reason, there have bc":n few 
research investigations of multiple samples scored indepen- 
dently by multiple readers. Table 1 presents a summary of 
24 research studies in which reliability estimates were 
reported for direct assessments of writing skill. These 
studies are summarized with respect to a number of factors 
that may have influenced the magnitude of the estimates 
reported. A consideration of these factors is useful as an 
introduction to the reliability estimation for direct assess- 
ments. 

Factors Influencing Reliability Estimates 

Table 1 is limited to studies reporting reliability estimates 
for direct assessments of junior high, high school, and 
college populations. However, quite a variety of social, 
ethnic, and ability groups is represented. The population 
sampled can influence reliability estimates if it is restricted 
in range of ability, but how such influences operate is not 
always clear. It is usually assumed that restrictions in range 
will attenuate estimates, but the actual effects are dependent 
on other aspects of the population distributions as well. The 
number of cases used for the estimate affects its stability: 
The larger the number of cases, the more stable will be the 
estimate. Reliability is also influenced by the type of writing 
tasks used and the amount of time allowed lor response, but 
little evidence is available concerning the effects of task 
type and timing on reliability. The most common type of 
writing sample is the brief, persuasive, or argumentative 
essay in which some position is to be taken pri ah issue 
presented and a thesis developed to support that position 
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Table J_. Studies Reporting Reliability Estimates for Direct Assessments 
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students 


8th grade 


135 










21 . 


— || 1- --= --| / | r\D^f 

Quellmalz ct al. (1982) 


High school 


None 


->nn 

ZlAJ 


Expository 


No( 


Analytic 


1-4 






students 








given 






22 


Steele (1979/ 


College 


Sample 1 


65 


Letters 


20 


Analytic 


1-5 






freshmen 


Sample 2 


50 










23: 


Wc.:*s uiU Jackson (1982) 


College 


None 


224 


Descriptive 


40 


Atomistic 


0-64 






students 




224 


Persuasive 


40 


Holistic 


1-6 








Posttest 


123 


Persuasive 


20 


Holistic 


1-6 


24. 


Wcrts et al. (1980) 


College 


None 


234 


Persuasive 


20 


Holistic 


1-6 






students 
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using examples, facts; or other evidence: The second most 
common type of sample is a narrative essay (at times called 
a descriptive or narrative-descriptive essay]; Any of those 
tasks can require the writer to consider a specific audience 
or purpose; More commonly, however, neither the audience 
nor the purpose is specified. Very few assessments of this 
type offer the examinee a choice of topics. The time allowed 
i'or the writing tasks in Table 1 varied from 20 minutes to 2 
hours with 20-minutes being the most common. 

The method of evaluation of writing samples can 
influence reliability. Essentially, three principal approaches 
to scoring of writing samples are represented in Table L 
Holistic scoring is the most common, but what is termed 
holistic scoring may vary from brie study to the next. The 
other two types of scoring represented are analytic scoring 
and atomistic scoring. The distinction between these two is 
hoi always clt ir, but in I this review analytic scoring refers to 
the development of several subscores which are either 
interpreted separately or combined to produce a total score. 
Atomistic scoring refers to a very detailed count of errors or 
a detailed scoring of many aspects of a sample. Any scoring 
with as many as 20 subscores has been considered here to be 
atomistic: even if the authors called it analytic. 

Scoring scales differ somewhat, and these also can 
affect reliability. The most common scale has been the 1 
(low) to 4 (high) scale often used for holistic scoring. Some 
observers believe (for example, Coffman 1971a, 1971b) that 
a greater scale range produces better reliabilities. A field 
test comparing a 1-3 scale with a 1-4 scale by Godshalk 
et al. (1966) suggested some improvement in reliability with 
the 1-4 scale, but Coffman ( 1971b) indicated a preference for 
an even greater range in scores. Large scale-ranges can be 
simply many points oil a holistic scale, or they can be 
developed through analytic and atomistic scoring as in the 
Breland (1983) 3-15 range scale based on three subscales or 
the Moss et al. (1982) 0-20 atomistic scale. 

Once scores have been assigned, they may or may not 
be adjudicated: Adjudication usually involves engaging an 
additional reader to resolve a scoring discrepancy between 
two other readers: Since highly discrepant scores are 
eliminated through adjudication, reliabilities increase. Two 
final procedural differences in direct assessments have to do 
with the total number of readers engaged arid the physical 
context of their engagement. The more readers there are, the 
more difficult training arid instruction is. Consequently, it is 
usually expected thai reliabilities will be less for a large 
group of readers than for a very small operation. For 
example, if only two readers are used, and if they are 
carefully instructed arid monitored, one would not expect 
much difference in their judgments. The two readers may 
also represent the same educational setting, such as an 
English department, and thus the likelihood of agreement 
may be quite high. 

The other procedural difference has to do with the 
setting in which the scores are generated. The most 
common setting is the conference setting in which readers 



are assembled at some central facility arid supervised in 
some way as they read. Another approach used less often is 
what might be called the ''remote'' method in which readers 
are riot assembled, but are riiailed sariiples with written 
instructions on scoring. At times", readers iitay be assembled 
initially for instruction, but the actual reading is conducted 
in their individual homes or offices and the materials 
returned through the mail. 

The reliability estimates reported in the studies of 
Table 1 were generated through different ^statistical pro- 
cedures. Often, a simple correlation between reader scores 
on a single topic is reported: At other times test-retest, 
alternate forms, and other types of correlations are reported. 
Coffman (1971a, 1971b) asserts that correlations at times 
tend to overestimate reliabilities because they do not take 
into account mean differences among scores. Analysis of 
variance procedures are preferred, he observes. Similar 
estimates are generated through confirmatory factor analysis 
procedures, but these depend on the specific model 
postulated for the analysis. 

All of the above differences in the ways direct 
assessments arc conducted arid analyzed often combine to 
produce unpredictable influences on reliability estimates 
reported in the literature. In art attempt to gain some sense 
of the magnitude of reliabilities that one might expect in a 
given situation, estimates reported in die literature have 
been assembled and identified as much as possible with 
respect to procedures: A basic distinction made in 
assembling these estimates has been between reading 
reliability estimates and score reliability estimates. 

Reading Reliability Estimates 

Reading reliability reflects error variance attributable to the 
inconsistencies among readers, but it does not reflect 
sampling error (the error introduced by providing only a 
limited opportunity to compose) or other sources of error. 
Reading reliability estimates will thus be inflated and cannot 
be used as an estimate of score reliability. Nevertheless, it is 
often useful to obtain an estimate of reading reliability as a 
gauge of the consistency of readers. When only one writing 
sample has been scored, it is not possible to estimate 
accurately anything but reading reliability. A comparison of 
reading reliability estimates obtained in a number of 
research investigations is presented in Table 2: estimates are 
grouped with respect to the number of tasks scored and the 
number of ratings per task obtained. 

Overall median estimates of .64, .70, and .78 were 
computed and are given at the bottom of Table 2 for three 
common situations. Note that relatively low estimates were 
reported in the Coffman (1966) paper. The estimates in 
Table 2 range from a low of .39 (for one task rated by one 
reader) to a high of 88 (for three tasks rated by five 
readers). In two other papers, Coffman (1971a, 1971b) 
observes that the range of scores assigned, the number of 
readers, and the method of estimate used will all affect 
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Wt 2; Reading Reliability Estimate Reported for Direct Assessments 



tstiiffle 
Number 



Scale 



One Task 



Mings per Task 



1 2 3 4 5 



Two Tasks 



Mings 'per Task 



12 3 4 5 



Three Tasks 



Mings per Task 
1 2 3 4 5 



Ateju (1972) 




Not described 




.72 


.84 


.88 


Brebd 11983) 


1 
2 


Atomistic 
Analytic 


1-20 
3-15 


J& 
.6? 


.5? 
.SO 1 






3 


Holistic 


H 


.54' 


,70- 




Coffifian (1966) 




Holistic 


1-3 


39 


;56 


;65 


Coffman (1971a) 




Holistic 


i-4 




.70 




Conry and Jcroski ( 1980) 




Holistic 


1-9 






,74 b 


Coward (1952) 


1 


Holistic 


|:9 


.54' 


.69 1 






2 


Analytic 




.n 


.82' 




ETS(I982) 




Holistic 


1-4 




.71 




Finlayson(1951) 




Holistic 


1-20 


.11 


,83 


.88 


Hackman and Johnson (1977) 




Holistic 


1-5 


.61 






Huddleston (1954) 




Analytic 




■i 






Michael etal. (1980) 




Holistic 


14 


.66* 


.80 1 




Moss etal. (1982) 


1 

2 


Holistic 
Atomistic 


1-4 
0-20 








Myers et al. (1966) 




Holistic 


1-4 


,41 


.58 


.67 


Pbwills etal: (1979) 




Holistic 


14 






.81' 


Steele (1979) 


1 


Holistic 


1-5 












Analytic 


3-15 








Trader and Anderson 1 1935) 




Analytic 


1-10 




.94 








Analytic 


1-10 




.84 




Weiss and Jackson (1982) 




Aloraislic 


0-64 


;7l 


;55 






2 


Holistic 


1-6 


.80* 


.66 h 




Esttae Medians - 








.64 


.70 


;78 



,5! ,68 ;76 SI .84 



J)l .94 



;60 ;75 ;82 ;t 



,90 



'Average over 4 samples. 

"Average over 2 samples, narrative and expository tasks. 

Average over 4 tasks, 

'Average over 3 (asks and 3 samples, 



'Average over 8 conditions. 
'Average over 8 conditions, 
'Average over 8 conditions; 
Average over 2 conditions, 



13 



ERIC 



14 



estimates. With respect to range of scores, the .suggestion 
was made that the greater the range, the greater the variance, 
obtainable, and thus the greater the reliability estimate. 
Table 2 supports such a speculation, since the low estimates 
reported by Coffman (1966) were based bri a score range of 
only 1 (low) to 3 (high). Myers et al. (1966) Used similar 
methods with a 1-4 scale arid obtained estimates similar to 
those reported by Cdffmari (1966). In a hypothetical set of 
data, Coffman (197)b) demonstrated that two ratings of the 
same 25 papers correlated .87 when a 15-point scale was 
used but only .1" when a 5 -point scale was used. These 
correlations, which represent reliability estimates for a 
single task and one rater, change also when the 15-point 
scale is cut in different places. 

As noted previously, when the number of readers used 
is iarge, it is more difficult to achieve consistency than when 
the number is small (because it is easier to train and instruct 
a small number). None of the studies in Table 2 examined 
this issue specifically, but the magnitude of estimates is to 
some degree associated with numbers of readers, where 
such information is available. The Coffman (1966) esti- 
mates, for example, are based bri ratings by 25 different 
raters; Finlayson (1951), in contrast, Used only six raters. 
Estimates based bri product-moment correlations will also 
tend ib be higher than those based on analysis of variance, 
because brie set of scores may have a different mean tjian 
another, arid differences in means are not reflected in a 
product-moment correlation. A comparison of the two 
methods was made by Coffman (1971b) using his hypotheti- 
cal set of 25 essays: For the 15-point scale, the reading 
reliability was .87 for the correlational method and .85 for 
the analysis of variance method. No comparison was made 
for the 5-point scale. The investigation of Michael et al. 
(1980) summarized in Table 2 also computed reliability 
estimates based on both methods, though the main object of 
the study was to compare expert arid lay readers. The two 
types of estimates were quite close with the exception of brie 
comprison where the analysis of variance estimate was 
somewhat lower. 

A fourth influence believed by many to be important is 
the length of the essay. In Table 2, the longest writing time 
reported in any of the studies is the 60-minute papers of 
Finlayson (1951). The reading reliability estimates are 
relatively high (.71 to .94), but these might be attributable to 
the large range of scores (1-20), the use of the analysis of 
variance method of assessment, the use of only six raters, or 
the combination of all three of these factors. A comparison 
of essay length (or time allowed) is possible within the 
Coffman (1966) study and within the Weiss and Jackson 
(1982) study. In the Coffman estimates, the suggestion is 
that essay length is unimportant because the 20-riiiriiite 
essays were estimated to have about the same reading 
reliabilities as the 40-minute essays. In the Weiss and 
Jackson study, a 40-minute essay had a slightly higher 
reading reliability estimate (.68) than did the 20-minute 



essay _~(.63j. It is hot clear from the studies listed iri Table 2, 
therefore, whether reading reliability is influenced by the 
length of the essay or the time allowed to write it. 

A fifth influence . suggested by Coffman arid others bri 
direct assessment reliabilities is the method of scoring. 
Three of the studies in Table 2 allow for a comparison of 
scoring methods. Coward (1952) compared scores on 
responses to four different tasks that were scored both 
analytically and holistically on a 1-9 scale: The analytical 
scoring involved the rating and weighing of several 
components, although the actual range of scores developed 
was not given: The seading reliability estimates were higher 
for analytic scoring for each of the four tasks analyzed. 
Weiss and Jackson (1982) used both holistic and atomistic 
scoring methods, and both holistic scorings yielded higher 
reading reliability estimates than did the atomistic scoring. 

In my own work (Brcland 1983), I have conducted all 
three types of scoring bri the same set of 20 : minute essays. 
An atomistic scoring was conducted through a 20-element 
checklist in which scorers checked specific attributes of 
essays bri a 5 : point scale. The scores on each checklist item 
were combined into an equally weighted sum to produce a 
score range from 20 to 100. An analytic scoring was 
accomplished by a different set of raters using a three- facet 
skill rating, each on a 5-point scale: The three facets were 
discourse quality, syntactic quality, and lexical quality, and 
were based on an analysis of the 20-element checklist. The 
analytic score was based on an equally weighted sum of the 
three-skill facets: Holistic scorings of the same essays were 
also made on two different occasions by two different sets 
of readers using a 1-4 scale. The results of scoring the same 
essays three different ways leads one to the conclusion that 
holistic scoring yields higher reliabilities than detailed 
atomistic scoring and it is also a great deal less tedious. On 
the other hand, it indicates that a limited amount of analysis, 
such as iri the three-facet scoring, can produce reading 
reliabilities higher than those obtained with holistic scoring. 

The analytic ratings, of course, required more reading 
tirrie, but costs were minimized by conducting the reading 
through the mail rather than in a conference setting: This 
difference in mail versus conference reading suggests one 
final influence on the reliability of readings. In a conference 
setting, readers can discuss their ratings and be supervised 
by table leaders and a chief reader. These influences have 
demonstrated through countless readings to result in better 
reliabilities of scores. But it is possible that carefully 
worded instructions sent through the mail can also result in 
improved reading reliabilities. The suggestion of the results 
from Table 2 is that carefully written instructions, when 
combined with analytic scoring procedures, result in 
improved reliabilities. Whether holistic scoring conducted 
iri a similar way would yield even higher reliabilities is not 
known, but Table 2 indicates that analytic scoring tends 
generally to produce the highest reading reliabilities when a 
single task is being scored: 
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Table 3: Score Reliability Estimates Reported for Direct Assessments 

One Task 

Scoring 



Estimate 
Number 



Wens ci al. U980J 
Estimate medians 
Spearman -Brown estimates 



Two Tasks 



Studv 




Method 


Scale 


1 
1 




Breland and Gaynor ( 1979) 




Holistic 


1-6 




.51 


eremsbh(I978) 




Holistic 


1-4 




.55 


Coff man (i960) 




Ho|istic 


1-3 


.26 


.38 


Fihlaysbn (1951) 




Holistic 


1-20 


.69 


.78 


Moss el al. (1982) 


1 


Holistic 


1-4 








2 


Atomistic 


0-20 






Quellmalz (1982) 


i 


Analytic 


1-4 








2 


Analytic 


1-4 






Steele (1979) 


1 


Holistic 


0-4 




.43 




2 


Holistic 


0-4 




.58 




3 


,\n;»lytir 


VIS 






Traxlcr and Anderson (1935) 




Analytic 


I- 10 







Ratings per Task 



3 4 



Ratings p er Task 



3 4 



.44 
.53 



.51 
55 

.42 .55 .62 Z-6 

.82 .88 .90 .91 

;46 

.73 



.58 
.73 

.60 

66 
.69 



.62 



Three Tasks 



Ratings per Task 



.76 
.82 



.70 
.76 



3 4 



;52 :65 :7! ;74 



.61 

.83 

.65 .70 



Score Reliability Estimates 

Table 3 provides a summary of 10 studies reporting 
estimates of score reliabilities, estimates that include hot 
only reader inaccuracies but also error variance associated 
with sampling. To develop such estimates, more than a 
single task and more than a single reading are required.* 
The most Frequent type of estimate reported, as Table 3 
shows, is that for two ratings per task — whether one, two, 
or three tasks were rated. For these cases, medians of the 
estimates are given at the bottom of the table: The median 
estimates for two and three tasks, respectively, are slightly 
less from what would be computed by the Spearman-Brown 
formula using the :53 median estimate for a single task as a 
base. This could mean that the :53 estimate is too high, and 
that the estimates for two and three tasks are too low. The 
low (;38) estimate made by Coffman was based on an 
extension from a 5-task, 5-reading analysis of variance and, 
additionally, is based on a 1-3 score scale— which prbbabjy 
attenuated the base estimate. The next Higher figure of .58 
reported by Steele (1979) is based on unusually explicit 
instructions and numerous prescored samples — advantages 
readers usually don't have. Thus, the .53 median estimate 
for the score reliability obtained when one task is scored by 
two readers seems reasonable. 

For three tasks arid two ratings per task, a Spearman- 
Brown estimate of .76 is higher than the median estimate of 



♦Note that multiple tasks may consist of multiple topics in the same 
discourse mode, a single topic in different discourse modes » or multiple 
topics in different discourse modes. 



.70. The Steele (197?) generalizability coefficient estimate 
was .65 for a 3-iask, 2-rating situation, but it may have been 
low because rating instructions were in the process of 
development. After rating instructions were improved, the 
generalizability coefficient increased to :76 — the same as 
the Spearman-Brown estimate: 

A few studies have reported reliability estimates for 
numbers of tasks or ratings in excess of those given in Table 
3: These are of interest because they give some indication of 
what accuracy one might expect if resources were available 
to conduct such assessments. Table 4 gives estimates 
reported in four studies. The Finlayson (1951) estimates for 
two tasks and six raters exceed the score reliability attained 
by many objective tests (arid thus appear extreme). 

Coffman (1966), using empirical estimates as a base, 
produced an extended matrix of reading and score 



Table 4. Reported Reliability Estimates Based oh 
Multiple Tasks or Rati ngs in Exc ess of Three 

Reliability Estimate 



Tasks/Raters per Task ReadT/ig Score 



Akeju (1972) 


i / 7 


.95 




Coffman ( 19665 


5 / 5 


:92 


.84 


Diederich and.Link ( 1967) 


4/ 2 




.80 


Finlayson (1951) 


2 / 6 


.96 


.93 


Steele (1979) 


6/2 




.75 




6/3 




.79 




9 / 2 




.79 




9 / 3 




.83 
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Table 5. Past Estimates of Score and Reading 
Reliabilities for Sets of Short Essays Read Holistically 
on a 1-3 Scale 



Number of Taste 



Number of 


Type if 












Ratings per Task 


Reliability 


1 




3 


4 


5 


1 


Score 


.26 


.41 


.52 


.59 


.64 




Reading 


.38 


.51 


60 


66 


.70 


2 


Score 


.38 


.55 


.65 


.71 


.75 




Reading 


:56 


:68 


.75 


.79 


.82 


3 


Score 


.44 


.62 


.71 


.76 


.80 




Reading 


.65 


.76 


.82 


.85 


.88 


4 


Score 


.49 


.66 


.74 


.79 


.82 




Reading 


.72 


.81 


.86 


.88 


.90 


5 


Score 


.52 


.68 


.76 


.81 


;84* 




Reading 


.76 


.84 


.88 


.91 


.92* 


'X. 


Content 


68 


.81 


.86 


.89 


91 



Source: Adapted from Coffman (1966). 
♦Based on empirical data 



to examine this matrix of estimates and to summarize the 
procedures Coffman used to generate it. The procedure 
begins with the 5-task, 5-rating cell based on empirical data 
arid these assumptions: 

1 . The essay tasks are random samples from a pool of 
tasks; consequently, the relationships among score 
reliabilities as tasks vary in number are governed by 
the Spearman-Brown formula. 

2. The raters are selected at random arid randomly 
assigned to essays. Under these conditions it is also 
assumed that the Spearman-Brown holds for 



reading reliabilities as the number of readings 
varies. 

3. The relationship between reading arid score 
reliabilities is governed by the_ concept of "con- 
tent", reliability (Gulliksen 1950, 211-214), in 
which content reliability remains constant as the 
number of readings changes. Content reliability is 
computed as the ratio of the score to reading 
reliability. 

Using these assumptic s, it is possible to start at the 5- 
task, 5-rating cell (based on empirical data) and complete 
the entire matrix as shown in Table 5. The further brie 
proceeds from the empirical base, of course, the less 
confidence one has in the estimates made. In the 1-task, 1- 
rating cell, for example, the estimates would be expected to 
be less accurate. Unfortunately it is at the low end of the 
matrix (few tasks arid few ratings) where most assessments 
are made. As a result, it would be of value to have better 
estimates for those situations. Moreover, since the Coffman 
(1966) estimates were based on an extreme scoring scale 
(only 1-3 points), they are hot generally applicable. One 
approach to better estimates would be to use the Coffman 
procedure, but to use as a base empirical evidence more 
generally applicable and to start at the opposite end of the 
matrix (few tasks and few raters). _Median estimates from 
Tables 2 and 3 can be used as an empirical base. For the 1- 
task, 2-rating ceil, good median estimates are available for 
both reading and score reliabilities. For the 1-task, 1 -rating 
ceii and for the 1-task, 3-rating cell, Table 2 provides 
reasonably stable reading reliabilities. 

Table 6 shows the matrix of reading arid score 
reliabilities developed using the Coffman procedure, the 
indicated empirical bases, arid some of Coff man's assump- 
tions. The second assumption, that the Spearman-Brown 



Ta ble 6. New Estimate s of Score and Reading Reliabilities for Various Combinations of Tasks and Ratings per Task 



Num ber of Tasks 



Number of 
Ratings per Task 




/ 


2 


3 


3 


5 


1 


Score reliability 


.48 


.65 


.74 


.78 


.83 




Reading reliability 


.64* 


.76 


.82 


.85 


.88 


2 


Score reliability 


.53* 


.70 


.76 


.81 


.85 




Reading reliability 


:70* 


81 


.85 


88 


.90 


3 


Score reliability 


.59 


.75 


.81 


.84 


.88 




Reading reliability 


:78* 


:87 


:90 


91 


.94 


oo 


Content reliability 


.76 


.86 


.90 


.92 


.94 



Note: The Spearman- Brown formula is: r„, = — 

1 + (n-l)r rt 

Where, r.. = the estimated coefficient 
r n = the original coefficient 
ri i = the number of times a test is lengthened 
* Based on the empirical data of tables 2 and 3 . 
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can be used to increase reading reliabilities as the number of 
ratings increases, was not used. Such a table of estimates 
can be only a rough guide to the magnitude of reliabilities 
one might expect in a given situation, of course. More 
precise estimates would recognize the specific effects bri 
reliability noted previously, namely, scoring scale range, 
number of readers to be trained, arid other factors. 
Additionally, the greater the sampling from the various 
stimulus arid discburse modes the greater the reliability one 
would expect. A comparison of Tables 5 and 6 suggests that 



Coffriiari's extended estimates i were somewhat lower than 
would usually be obtained, arid that his estimates nearer to 
his empirical base were slightly low; 

Reliabilities of Analytic Subscales 

Several of the studies summarized in the previous sections 
also examined analytic subscales. In Table 7, six studies are 
summarized in which reliabilities were reported either for 
separate analytic subscales or for an overall score derived 



Table 7. Reliabili ties Jtepbrted for Analytic Subscales 



Breland (1983) 

Discburse quality 

Syntactic quality 

Lexical quality 

Total of subscales 

ECT Holistic Score 
Conry and Jeroski (1980) 

Organization 

Sentence structure 

Spelling 

Handwriting 

Vocabulary 

Punctuation 
Diederich and Link (1967) 

Ideas 

Organization 
Wording 
Flavor 
Usage 
Punctuation 
Spelling 
Handwriting 
Hackman and Johnson ( 1977) 
Mechanics (subsentence level) 
Mechanics (sentence level) 
Organization 
Thought 
Style 

Overall quality 
Quellmalz. Capell. and Chou (1982) 
Focus 

Organization 
Support 
Mechanics 
Steele (1979) 
Language 
Organization 
Audien.e 
Total analytic 
Holistic 

'Adjudicated scores 
b l2th grade, narrative 
c 12th grade, expository 



Number of 
Tasks 



Ratings per 
Task 



Reading Reliability 
Estimates 



#i #2 #i 04 



Score 
Reliability 
Estimates 



.69 .74' 
.70 .75' 
.71 .74' 

:78 :82' 

.76' 



.32 b 
.47 
.46 
.47 
.59 
.28 



.55 c 
.62 
.53 
.55 
.57 
.52 



.56 d 

.63 

.61 

.51 

.76 



.66 e 
.51 
.71 
65 

.58 



.83' 
:81' 
.64' 
66' 
.70* 
.61' 



2 
2 
2 
2 
2 
2 
2 

2 J 

2 
2 
2 
2 
2 
2 

2 
2 
2 
2 



2 
2 
2 
2 

2 

d 8lh grade, narrative 
e 8th grade, expository 

'Overall reliability of composite of analytic scores 



.80 + ' 



.83 
.74 
.48 
;82 
.76 



is 
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from the analytic subscales. The best known of these 
analytic scoring schemes is that of Diederich and Link 
(.1967). The construction and use of these subscales are 
described in Diederich (1974), ard the Factor analysis from 
which they were der.ved is reported by French (1962). A 
iota! score is derived from the eight scales by rating each on 
a 1 (low) to 5 (high) scale, doubling the weight for ideas and 
organization, and summing. Thus the total score can range 
from i6 to 50. Diederich and Link (1967) report that this 
cumulative total of eight ratings, when applied indepen- 
dently to four different papers, results in a score teiiability 
of .80 or more. The average reading time per paper is about 
5 minutes. 

Ti;? analytic subscales of Conry and Jeroski (1980), 
Hackman ar»d Johnson (1977), and Quellmalz et al. (1982) 
are somewhat similar to the Diederich subscales. All have 
organization as one subscale, and all have mechanics — -_ 
either as a subscale or as represented by specific aspects of 
mechanics. The Steele (1979) subscales are different in thai 
they don't attend to mechanics at all, except as it relates to 
language. The stimulus was au r *U (taped) rather than 
written, and the examinee was required to consider audience 
and purpose as important. Each element was rated on a 0-4 
scale. The use of audience as a subscale is of particular 
interest because such a scale allows for an evaluation of 
skills relating to audience issues and (implicitly) issues of 
purpose. But the score reliability obtained for the audience 
subscale (.48) was disappointing, suggesting that such a 
factor is difficult to score: 

These analytic approaches tend to be limited also 
because they focus only on parts of the total domain of 
interest. Since there are numerous aspects of writing skill, 
and since these vary from one mode of discburse to the next, 
it is usually assumed that only a few aspects can be rated. 
And when only a few characteristics are rated, there is 
always the possibility that something important may have 
been overlooked or that one element may receive more 
weight than it merits. Of course, these limitations of 
analytic scoring — as well as the added tihie it takes — are 
the principal arguments for holistic scoring. 

The Breland (1983) scales represent a com P r °mise 
between analytic scoring as it is usually conducted and 
holistic scoring. While empirically based, these scales do 
not represent an extraction of factors as in the Diederich 
approach: Such factor analytic approaches are limited 
because (1) they are appropriate only for the particular 
discourse mode used for the factor analysis, and (2) they do 
not cover the entire domain of skills, _as does holistic 
scoring. As a compromise, the Breland (1983) scales might 
be more aptly labeled "focused holistic scales/* That is, 
they Focus bh tliree distinct qualities of writing, but in doing 
so they do not exclude any specific characteristics. They 
represent u dividing up of holistic scoring into three 
domains. Because nothing is excluded, the scales can be 
applied to samples from any mode of discourse — provided 
that each subscale is appropriately defined: For example, in 



ah argumentative mode of discourse, the scale "discburse 
quality" would include ah evaluation of the degree to which 
supporting evidence is used, but in a naiTative-descriptive 
mode, use of supporting evidence would not be evaluated as 
a part of discourse quality because no argument is being 
made. 

Summary of Reliability Evidence 

The reliability of direct assessments of writing skill is 
limited primarily by measurement errors resulting from 
reader inconsistencies, content sampling biases, and inter- 
actions between these two sources of error. Reliability 
estimates found in the literature are influenced by the 
population studied, the number of cases examined, task 
type, number of tasks, number of readers, time allowed, 
scoring method used, and scoring _ range. The most 
important influences appear to be number of tasks, numbe r 
of raters, scoring method, and scoring range. Considering 
enly number of tasks and number of ratings per task, it can 
be expected that score reliabilities will range from about .50 
(for one task and one rater) to about :90 (for five tasks and 
three ratings per task): Higher scoring ranges, up to about 15 
judgmental points, seem to generate slightly higher 
reliabilities. Analytic scoring methods with a limited set of 
scales may produce higher reliabilities than holistic scoring, 
though detailed analysis using many scales (atomistic 
scoring) appear to yield the lowest reliabilities. 



VALIDITY OF DIRECT ASSESSMENTS 

Validity is often considered with respect to several specific 
procedures used in the process of examining measures; 
concurrent validation, predictive validation, incremental 
validation, validation of subscbres, content validation, arid 
construct validation. With the exception of the last two 
procedures, the methods used are essentially correlational 
methods. That is, a criterion of some type is correlated with 
the measure being examined. In content validation, a 
systematic examination of the content of a test is made to 
determine the degree to which it samples the skill purported 
to be measured. Construct validation requires an examina- 
tion of the degree to which an assessment measures some 
theoretical construct, or trait. Construct validation involves 
the gradual accumulation of evidence from a number of 
sources including correlational evidence, internal consis- 
tency, the influence of instructional interventions, and any 
other available sources. The following sections report 
evidence for direct assessments for various types of validity. 



Concurrent Validity 

Table 8 summarizes five studies in which some direct 
measure of writing skill was correlated with a criterion 
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measure at or about the same time. The most common 
criterion used in these studies was the high school gpa, but 
concurrent correlations are also shown for high school and 
college grades in English composition courses, for high 
school instructors' ratings of writing ability, and For more 
reliable direct assessments of the same type being validated. 

The validity evidence reported by Coffman (1966) is 
different conceptually from thai of the other studies in Table 
8. The criterion variable was the siim of scores obtained 
from four different essay tasks, each scored independently 
by four different raters. As a result, the criterion was based 
oh 16 independent Judgments and had a score reliability 
estimated at .79. This is a relatively high reliability for 
direct assessment; moreover, the single essay being exam- 
ined for validity was similar to the criterion essays and was 
scored in the same way: The correlation of :56 obtained is 
therefore not surprising nor is the fact that it is the highest of 
any of the correlations in Table 8. 

The earliest study of this type reviewed was that of 
Huddleston (1954). For the 763 high school students studied 
two criterion variables — average high school English grade 
and an instructor s rating of their writing ability— were 



accessible. An essay score was the total of two judgments 
(content arid style) of a sample of writing (approximately 
150 words) made by each of two English teachers: this 
essay score was found to correlate :43 and :4[ respectively 
with high school English grades and high school instructors' 
ratings of writing ability. 

The concurrent validity comparisons in the Breland 
(1977) study were based on the criteria of high school rank 
(self-reported), high school English grades (self-reported), 
college freshman English grades (fall), and college fresh- 
man english grades (spring). The relationships between the 
essay pretests (administered in college English courses) and 
both high school rank and high school grades was .37. The 
relationship with college grade was much less, .23. The 
smaller correlation with college grades may have been a 
result of instructional influences, or id the probably lower 
reliability of English grades as compared id high school 
rank. In any event, some correlation with course grades 
would be expected because the essays were written toward 
the end of courses . 

The Hackman arid Johnson (1977) study, reported in 
Table 8, used high school gpa as the criterion and a holistic 



Table 8. Studies Reporting Concurrent Correlations with Direct Measures of Writing Skill 







Direct 






Study 




Measure of 






and Setting 


N 


Writing Sm 


Criterion Measure 


Correlation 


Breland (1977) 


799 


Eall essay pretest 


High school rank 


.37 


College 


756 


Fall essay pretest 


tast high school English grade 


.37 


freshmen 


878 


Fall essay pretest 


Bill English grade 


.23 




491 


Spring essay posttest 


Spring English grade 


.23. 


Breland (1983) 


800 


FCt holistic score 


Last high school English grade 


.20" 


College 






High school rank 


.18' 


applicants _ _ 










Coffman (1966) 


296 


One essay scored by 


Four essays scored by 


.56 


High school 




two readers 


four readers 


students 








.20 


Hackman and Johnson (1977) 


36 


Fall essay pretest 


High school GPA 


Yale freshmen 








43 


Huddleston (1954) 


763 


Essay score 


High school English grades 


High school 


763 


Essay score 


Instructors rating of writing ability 


:4i 


students 








.40 


Michael et al. (1980) 


ioo 


30-minute essay 


Cumulative college CPA 


College juniors 


(first 
sample) 


(expert readers) 
3.0-minute essay 










(lay readers) 




.36 




iob 


30-minute essay 




.05 




(second 


(expert readers) 






sample) 


30-minate essay 










(lay readers) 




.06 


Michael arid Shaffer ( 1978) 


687 


45-minute essay 


High school GPA 


.15 


High school 


656 


In-class essay 




.17 



students 



Median correlation 



■Median over four samples 



.23 



is 
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score on a 40-minute essay read independently by two 
readers: The relatively iow correlation of .20 may be related 
to restriction of range, because all subjects had been 
admitted to Yale University. Most had very good high 
school records. . 

The Michael and Shaffer (1978) investigation also used 
high school gpa as the criterion, the validity correlations 
reported, .15 arid ,17, are similar to the .20 figure reported 
by Hackman and Johnson for Yale students , even though the 
California State University and Colleges (csuc) sample was 
riot restricted in its range of abilities. 

In the Michael et al. (1980) study, two random samples 
of approximately 100 college juniors each wrote 30-minute 
essays on two different topics: Each response was rated by 
both English professors (experts) and by professors in other 
departments (lay readers): Two of each type of rearer read 
each essay, and the total score was obtained by adding the 
two ratings. The criterion measure was the cumulative gpa 
of each student up to the time of the investigation. For the 
first sample, the observed correlations between reader 
scores and gpa were better (.41 and .40) than those for the 
second sample (.05 and .06). Small differences in 
reliabilities of ratings favored the expert readers, but these 
differences were not considered important ones. The main 
differences between the first sample arid the second sample 
data were in the writing tasks, though the details of the tasks 
used were not reported. It was suggested that the specific 
topic of an essay, or the specific writing task required, may 
have a substantial bearing on the validity of an assessment. 

Predictive Validity 

While the concurrent correlations just reviewed are predictive 
in a sense, the usual interest is in examining how well a 
measure predicts some event which occurs at a later time. In 
the case of writing skills, therefore, we want to demonstrate a 
relationship, for example, between a precourse test and a 
course grade, between a preadmission test and gpa after 
admission, or between writing skill as assessed at one time and 



writing skill as assessed at a later time. Table 9 presents results 
from four studies that have reported such relationships: 

The Breland (1977) and Michael and Shaffer (1978) 
studies, reviewed earlier for concurrent correlations, also 
examined data on student English course grades and on 
writing samples collected toward the end of courses. The 
Werts et al: (1980) article represented a refinement of the 
same data of the Breland (1977) study through analyses of a 
complete but smaller data sample. As in the concurrent 
correlations of Table 8, the relationships between writing 
sample scores obtained a< different times are higher than 
relationships between writing sample scores arid later course 
grades. The direct measures correlate with each other about 
at the level of their score reliability (about .50 in this case), 
but they are not highly predictive of performance either in 
English courses or overall. 

Incremental Validity 

Because of the expense of direct assessments of writing 
skill, a central issue over the years has been whether or not 
an essay adds significantly to the measurement accuracy 
provided by other available measures — the high school 
record, objective test scores, or other information. Despite 
the importance of this issue, it has not often been examined. 
Table 16 gives the results from five studies that have in some 
way provided useful evidence. 

The Breland and Gaynor (1979) study considered the 
effect of adding an essay when already available were high 
school rank (self-reported), last high school English grade 
(selPrepbrted), sAt-verbal score, arid tswe score. TVvb 
criteria were used: freshman English composition course 
grade arid a postcourse essay assessment consisting of the 
sum of scores received on essays written toward the end of 
both the fall and spring semesters. The grade criterion was 
examined within each of four colleges; L the L essay criterion 
was examined for ail four colleges combined. Significant 
beta weights were obtained for the essay pretest in all four 
colleges combined when the essay criterion was used. The 



Table 9. Studies Reporting Predictive Correlations for Direct Measures of Writing Skill 



Study 



and Setting 


N 


Predictive Measure 


Criterion Measure 


Correlation 


Breland (1977) 


886 


Fall essay pretest 


Fall Fn^lisb grade 


.28 


Four colleges 


400 


Fall essay pretest 


Spring English grade 


.26 




904 


Bill essay pretest 


Fall essay posttest 


.52 




316 


Fall essay pretest 


Spring essay posttest 


;5i 


Michael and Shaffer ( 1978) 




EPT essay 


Fall GPA 


.21 


California State 


6J7 


EPT essay 


Fall English grade 


.31 


University, Northridge 


657 


In-ciass essay 


Fall GPA 


.25 




604 




Fall English grade 


.32 


Werts et al. (1980) 


234 


Fall essay pretest 


Rill essay posttest 


.56 






Rill essay pretest 


Spring essay posttest 


.57 
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i^bie_10. Studies Reporting Incremental Validity Evidence for Direct Measures 



Study 



N 



Brelahd and Gaynor (1979 
College freshmen 



Checketts arid Christeriseri (1974) 
CLEP examinees 

Godshalk et al. (1966) 
High school students 



Huddleston (1954) 
High school students 



76 

(College A) 



160 
(College B) 



204 
(College C) 



_ 135 _ 
(College D) 



213 
(Four 
colleger ) 



123 



237 



254 



420 



763 



Criterion 



Freshman English 
course grades 



Postcourse essay 
assessment 



Freshman English GPA 



Four brief essays, 
each read 5 times 



Average English grade 



Instructor's rating 
of writing ability 



Average English grade 

Instructor's rating 
of writing ability 



Predictors 

HS rank . 

HS English grade 

SAT-.? 

TSWE 

Essay pretest 

HS rani- 

HS Englisi. grade 
SAT-V 
TSWE 
Essay pretest 
HS rank 

HS_Engiish grade 
SAT-V 
TSWE 
Essay pretest 
HS rank 

HS English grade 
SATV 
TSWE 
Essay pretest 
HS rank 

HS English grade 
SAT-V 
TSWE 
Essay pretest 

CLEP objective 
CLEP essay 



held 



Incremental 
R 

n (direct) 



.10 
Al 
.00 
10 
.20 
.04 
.28 
.00 
.05 
.22 
.20 
.00 
;25 
.U3 
.22 
.25 
.13 
.00 
.13 
.19 

M 

;09 
.16 
22 
.38 



39 :04 



.43 .04 



:51 :03 



.50 .02 



.76 .05 



.53 



PSAT-V sentence 


.69 


28 


.77 


Correction prose groups 


;67 


.27 




Essay A (2 readings) 


.56 


.13 






.55 


.26 




PSAT-V sentence 


.63 


.20 


.75 


Correction prose groups 


.68 


.36 




Essay B (2 readings) 


.56 


.15 






.49 


.23 




Objective English 


.60 


.18 


.80 


Essay- content 


;26 


;02 




Essay-style 


.39 


.id 




Paragraph A 


.29 


.03 




foragraph B 


.33 


.08 




Verbal test 


.77 


.58 




Objective English 


.58 


.16 


:79 


Essay content 


.22 


-.03 




Essay- style 


.39 


.13 




Paragraph A 


.26 


.00 




Paragraph B 


.33 


.09 




Verbal test 


.76 


.60 




Objective English 


.34 




.56 


Two essay ratings 


.43 






Two paragraph ratings 


.34 




.56 


Two essay ratings 


.41 







:06 



.02 b 



.03 b 



.07" 
.05' 



Note: A dash ( — - ) indicates information not reported. 

•includes two essays and two paragraph ratings. 

•This increment is based on a comparison with prediction by four objective tests. Actually, one objective test was replaced by an essay in conductmg the 
study. Consequently, the increment attributable to the essay is slightly larger than the figure reported here, but the precise amount is unknown. 
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Table 10. Studies Reporti ng Incremental Validity E vidence for Direct Measures (Continued) 



Incremental 



Study 


N 


f ntf*rit\n 
\— t * * C t tut $ 


Pre (I ic t ors 


f 


helu 


R 


{direct ) 


Michael aiid Shaffer (1978) 


1583 


Rill GPA 


HS GPA 


— 


25 


.38 




College freshmen 






EPT : readihg 




.11 












EP1- essay 
















EPT-seht. cbhstr 




hs 












EPTJogic & org. 




.06 








637 


Freshman English grade 


HS GPA 




.23 


.48 










EPT- reading 




.15 












EPT-essay 




.12 


k 










EPT-sent- constr. 




.18 












EPT-logic & org: 




ns 























Note: A t ash ( — ) indicates information not reported: 
'includes two essays and two paragraph ratings. 

This increment is based on a cotupansori with prediction by foar objective tests: Actually: one objective test was replaced by an essay in conducting the 
study. Consequently, the increment attributable to the essay is slightly larger than the figure reported here, but the precise amount is unknown 



average increment in the multiple correlation, attributable to 
the essay, was about .04. 

Checketts and Christensen (1974) studied the clep 
objective arid essay c; mponents and obtained an increment 
in the multiple correlation predicting a fresh lan English 
average of .06 owing to the essay: The clkp essay and 
objective components are each 90 minutes in length — so the 
results are not precisely comparable to the more common 
20-minute essay and somewhat shorter objective compo- 
nent. But the similarity of the .06 increment to the .04 
increment indicated in the Breland and Gaynor study would 
suggest that not a great deal is gained by the longer essay. 

The Godshalk et al., (1966) study has been cited on a 
number of occasions in_this report. The incremental validity 
evidence reported in Table 10 was developed in a special 
fit .Id trial in which four of the five essays used were criteria 
arid the fifth was a predictor. Two different essay topics were 
used as predictors, Essay A and Essay B. The criterion thus 
excluded either Essay A or Essay B. As noted in Table 10, 
the incremental R observable in the Godshalk et ai: study 
was the difference between the ft obtainable from four 
objective predictor tests and the R obtained when one of the 
four objective tests was replaced by an essay test. Thus the 
incremental R shown is attenuated by some unknown 
amount. Another possible comparison is between an 
objective test prediction using three objective tests of 
composition (but excluding the PSAT-verbal) and the. Table 
10_ multiple J?s. Such a comparison tends to artificially 
inflate the increment, but the values obtained are .05 arid 
.04 respectively for essays A arid B. The true increment lies 
between these figures arid those shown in Table 10. 

The Huddleston (1954) study reported that a verbal test 
(essentially the SAT- verbal) accounted for practically all of 
the variance in both of these criteria — average high school 
English grade and high school instructors* ratings of writing 



ability. A multiple correlation of .80 was obtained for the 
prediction of average high schoo English grades from an 
essay (rated for both content and style), two paragraph 
revision exercises, an objective test of Engli: 1 !, and the 
verbal test. But the verbal test alone correlated .17 with the 
criterion, indicating that all other variables including the 
essay test, the paragraph revision exercises, and the 
objective test of English added little (.03) to the prediction. 
A similar result was obtained when the criterion was 
instructor rating of writing ability. The essay style rating 
contributed more to the prediction than the content rating, 

suggesting that content was less reliably assessed. 

The final study of Table 10, that of Michael arid Shaffer 
(1978), also used two criteria. The first criterion was fall 
semester gpa arid the second, grades in a freshman English 
course. Significant beta weights were brained for the 40- 
miriute eft Essay (scored by two readers) for both criteria. 
Incremental multiple correlations comparable to other 
studies in T&ble 10 were not reported, but some were, For 
example, the summation of the ept composition compo- 
nents (sentence construction, logic and organization, and 
the essay) predicted tall semester gpa with an r= .29, 
whereas sentence construction correlated .27 and logic arid 
organization .26 with the same criterion. For predicting 
grades, the summation of the three composition scores 
produced a correlation of .41, whereas sentence con- 
struction and logic and organization correlated respectively 
.38 arid ,33 with the criterion. 



Validity of Analytic Subscores 

Recent interest in diagnosis calls for an examination of 
validity evidence reported for analytic subscores in direct 
assessments: Although analytic scales are often used in 
developing scores for direct assessments, data are not often 
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reported for them. Some reliability data for analytic 
subscaies were Jescribed previously in Table 7. Table 11_ 
summarizes three investigations in which some Rind of 
correlational validity evidence was reported for an ?nalytic 
subscore. The.studies by Hackman and Johnson (1977) and 
Huddles ton (1954) are iri some senses similar becaL** of the 
high school grade criterion and thr types of subscaies used. 
In both style appears to be a more valid subscore than 
content {thought in Hackman and Johnson). However* 
grammar in the Huddles! jri study had the highest validity (r 
= .49 with instructor rating of writing ability). The 
generally lower correlations in the Hackman and Johnson 
study are probably attributable to the select sample (Yale 
freshmen) being studied: 

The Breland (1983) data also show slightly higher 
validities for grammatical types of ratings as opposed to 
higher order skills. For both criteria, a syntactic quality 
rating and a lexical quality rating yielded higher correlations 
than a rating on uiscouise quality. The discourse quality 
rating rtflectcd qualities similar to the organization, 
thought, and content subscores reported f'jr other studies iri 
Table 11. Despite the importance that most observers, 
including members of the English teaching profession, place 



bri discourse, thought, content, organization, and similar 
qualities, the validity evidence shown in Table 11 favors the 
more mundane skills. 

Construct Validity 

Quellmalz et al. (1982) have recently revived issues of 
construct validity in writing skill assessment. One construct 
validity issue was the long-standing question of whether 
direct and indirect assessments both measure a unitary travt 
that is not easily djvisiMe. Most past research has concluded 
that direct and indirect assessments are highly correlated, 
even ; .f it could not be demonstrated conclusiveiy that they 
measured the same underlying trait (Huddleston 1954; 
Breland and Gaynor 1979; Coffman 1966; Werts et al. 
1980), While Quellmalz et al. were not able to answer the 
question unequivocally, their results indicated that indirect 
assessments, as well as different types of direct assess- 
ments, measure different skill constructs. In particular, 
discburse mode (for example, expository, narrative) and 
response mode (production vs. recognition) were suggested 
as influences on the assessment; The study also compared 
analytic judgments of essays with objective assessments of 



Table 11. Validity Evidence for Analytic Subscores 


Study and 










Selling 


N 


Subscore _ 


Criterion Measure 


Correlation 


Breland (1983) 


800 


Discburse quality 


Last high school 


.19 (.20) 


Random samples 




Syntactic quality 


Erglish grade 


.26 


of ECI-takers 




Lexical quality 




• 24 






Analytic total 




.26 






Discourse quality 


High school rtnk 


.17 (.18)* 






Syntactic quality 




;2l 






Lexical quality 




.20 






Analytic total 




.22 


Hackman and Johnson 


173 


Mechanics (subsentence) 


High school grade avt.age 


.20 


(1977) 




Me^harics (sentence) 




.22 


Yale college freshmen 




Organization 




.19 






Thought 




.19 






Style 




.27 


Huddleston (1954) 


294 


Punctuation 


High school English grades 


.25 


High school students 




Idiom 




21 






Grammar 




.33 






Sentence structure 




.33 






Punctuation 


Instructor rating of 


.2P 






Idicrri 


writing ability 


.22 






Grammar 




.49 






Sentence structure 




.36 




763 


Content 


High school English grades 


.28 






Style 




.40 






Content 


Instructors' ratings of 


.24 






Style 


writing Ability 


.39 



•The discburse ratings were similar iri emphases to the holistic ratings made foi the ect administration. Correlations between the 
criterion and the ect holistic ratings are given in parentheses. 
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parallel skills — focus, organization, , support, arid rhechari- 
ics : Their analyses indicated that focus ^id brgariizatidn 
defined a single Factor (termed coherence), but that support 
and mechanics were distinct factors measurable by both 
direct arid indirect methods. 



Content Validity 

No analyses of the content validity <tf direct assessments 
were encountered in the present review. The analyses of 
Quellmalz et al. (1982) touched on content validity, 
however, since different modes of discourse were examined, 
in that stuuy; students who scored high on narrative tasks 
were not the same students who scored high on expository 
tasks. These results suggest that content sampling is 
•mponant in direct assessment. Beyond the influence of 
discourse mode, the specific topic of the direct assessrrierit 
n>ay Have additional influences.. All students do riot have 
equivalent knowledge about all topics. Direct assessments 
in which a single topic arid a single discourse mode are used 
clearly are limited in content validity. 

Summary of Validity Evidence 

Evidence in support of the validity of direct assessments of 
writing skills is available from several perspectives. 
Concurrent correlations with high school rank, high school 
English grades, instructor rating of writing ability, and 
college gpa all showed statistically significant relationships, 
though these correlations were at times relatively low. 
Predictive correlations with college English grades arid gpa 
were similar in magnitude, although also significant 
statistically. Incremental validity evidence was reported in a 
number of studies, showing that direct assessments of 
writing skill contribute information beyond that avail able 
through previous academic records arid other kinc'i of test 
scores. In those few investigations reporting validity 
evidence for direct assessments of writing subskills, ratings 
of grammatical skills tended to yield slightly higher validity 
coefficients than ratings of content, discourse quality, or 
thought. The only type of validity evidence not located for 
direct assessments was evidence of content validity. Since 
only one writing task was often employed, content sampling 
from the domain of all possible writing tasks was of course 
severely limited. 



TECHNOLOGICAL DEVELOPMENTS 

Recent technological developments in text processing may 
afford ail opportunity to improve direct assessments of 
writing skill. There is hope that the present impasse between 
the unreliability of the usual assessments and the labor 
intensiveness of more reliable and valid assessments can be 
broken by appropriate applications of technology. Past 



solutions to this dilemma have relied on multiple-choice 
assessments as a source of reliability and brief judgmental 
assessments/ as a source of validity. Few accept such a 
combination as the ultimate solution. Most multiple-choice 
assessments cover only a narrow range of the writing skill 
domain, and most judgmental assessments are made on brie 
sample written in brie mode of discburse. The limitations of 
current direct assessments are almost always a consequence 
of the labor intensive riess of better direct assessments. 

The use of technology in writing assessment is not a 
new idea._Iri an extensive project conducted for the U.S. 
Office of Education, Page arid P&Ulus (see Page 1966, 1968a, 
1968b for sumrriarizatibris of this work) developed tech- 
niques r or scoring essays and for providing instructional 
feedback to students through computer analysis of essays: 
Indices were developed that predicted judgmental scores 
through a procedure adapted from Diederich's (1974) 
analytic scoring procedure; The computer was shown to be 
about as good a predictor of human judgments as human 
judges themselves, in view of the time that has now passed, 
however, the optimism expressed by I^ge (1966, 238) was 
clearly excessive: "We will soon be grading essays by 
computer, and this development will have astonishing 
impact on the educational world." [Page's emphasis] This 
statement was in error — at least with respect to the word 
soon._ 

One of the reasons Page's work did not catch on was 
the English profession's negative response (see, for exam- 
ple, Macrorie 1969). Although it has been pointed out that 
some of the negative reactions to Ffcge's work miss the point 
(Slotnick and Knapp 1971; Slotnick 1972), that what is being 
studied are the cognitive processes of experienced English 
teachers, these assertions have not sufficed to revive the 
idea. The limitations of the technology of the late sixties and 
its consequent lack of availability also caused the idea of 
computer assessment of writing to founder at that time. 
Recent strides in microchip technology and widespread 
acceptance and use of text processing procedures have 
changed the context in which technology operates. Despite 
this changed context, most English teachers arid most 
examinees would probably never accept a computer's 
judgment of the quality of a piece of writing. On the other 
hand, descriptions, counts, and other computer-generated 
information that is useful but not evaluative would likely be 
more acceptable. 

Ail example of such descriptive information is 
provided by the Writer's Workbench program developed at 
Bell Laboratories (Frase 1980; Erase et al. 1981). The 
Writer's Workbench consists of a growing set of computer 
aids for editing and reformatting written documents. In 
addition to simple programs that check spelling arid 
punctuation, included also are more complex routines that 
flag poor diction, weak phrases, and other x that compute 
readability indices, compute the total number of unique 
words used, arid compare a written piece with some 
standard piece written by a well-known writer. Frase et al. 
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(i98ij have also written about the ethics of imperfect 
measures, such as readability indices. Because of the 
limitations of imperfect measures, they recommend the Use 
of multiple measures, the use of relative rather than absolute 
evaluations, and the treatment of imperfect measures as 
informatibh rather than decisions. Noting the failure of 
Page's idea to grade essays by computer, it is suggestedjhat 
humans will never relinquish human judgment to imperfect 
measures arid that this fact must be recognized by those who 
develop imperfect measures of writing skill. 

A much more sophisticated text-critiquing system, 
EPISTLE, is currently under development at IBM (Heidom 
et al. 1982). The EPISTLE system is more sophisticated 
than the Writer's Workbench because it uses a parser that 
breaks down sentences into component parts of speech and 
relates the form, function, and syntax of each part. By 
contrast, the Writer's Workbench is only a collection of 
programs that identify characteristics of writing. A parse 
See of a sentence can show for example, that the distance 
between a subject arid verb is too great. EPISTLE also 
performs paragraph-level critiques such as noting that there 
are too many passive sentences or too many compound or 
complex sentences. Heidom et al. emphasize, however, that 
EPISTLE is still in the experimental stage. 

There are also writing computer assessment activities 
under way in the academic setting. One well-developed 
computer-assisted instructional program is JOURNALISM 
(Bishop 1974). JOURNALISM performs stylistic analysis 
by reporting variety in sentence length and overuse of 
articles, passives, adjectives, and adverbs. It also checks 
spelling and keeps students' records of progress. Another 
academic approach to the writing assessment problem is that 
of Finn (1977). Finn's approach is to focus only on word 
choices and to relate those to standard frequency counts to 
develop an index of writing maturity. Mbe (1980) describes 
programs that count words arid word strings of various 
types, analyze sentences, arid estimate readability. 

A fundamental notion, that of an automated dictionary, 
was brought to the attention of the National Institute of 
Education in 1978, and later a conference was held (Miller 
1979). Since that time* software companies have developed 
automated dictionaries that function in consort with 
proofreading programs: These kinds of developments are 
likely to proceed rapidly if recent history is any guide. 



SUMMARY AND CONCLUSIONS 

The history of direct writing skill assessment is dominated 
by the issue of reliability. Specifically, the issue is the 
limited reliability of the usual kind of direct assessment in 
which an examinee produces a sample of writing on some 
topic during a limited time period, and that sample is then 
evaluated by brie or more judged. As simple and straightfor- 
ward as such procedures seem on the surface, the fact is that 
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they are not simple at ail; Much has been written about the 
inconsistency of the judgments of writing samples by 
English teachers and others. But there has been little 
examination of other kinds of limitations in the usual writing 
assessments. One important limitation riot often examined 
has to do with the degree of content sampling usually 
conducted. 

The sampling domain for direct assessrrierits reflects all 
possible types of stimuli (written, pictorial, aural, for 
example) arid all possible modes of discourse (narrative, 
exprecsive, argumentative, for example). For each combina- 
tion of s. riulus arid discourse mode, different contexts for 
writing occUr. How much time is allowed for the writing? 
What reference materials, if any, are allowed? Whans the 
purpose of the writing? Who is the audience? When one 
adds the context variables to the different stimulus types and 
the different modes of discourse, the domain from which 
any particular writing sample is drawn is extensive indeed. 
Because the usual writing sample represents only one kind 
of stimulus, only one mode of discourse, arid only brie 
context, it is a small sample of the possible domain of tasks 
that might be used to assess writing skill. Since some 
examinees are likely to perform better at some tasks than at 
others, the use of only a limited sample from the domain 
will result in errors in the assessment. These errors, in 
addition to the errors introduced by reader inconsistency, 
make reliable direct assessriierit difficult to attain: 

Reliabilities of essay assessments can be made 
acceptable, of course, through the use of expensive 
multiple-topic, multiple-mode L and multiple-reader pro- 
cedures — as the evidence presented shows. Consequently, 
there is nothing inherently unreliable about the general 
approach: It is probably true, nevertheless/ that student 
behavior in producing writing samples is less consistent 
than it Is for more structured tests. There are more choices 
to make, more consequences of poor choices, arid there is 
less control over the order of responses. As a result, it is 
difficult to attain very high reliabilities when these 
inconsistencies are coupled with those of readers making 
judgments of the samples. 

It has been effectively argued (Coffman 1966) that 
direct assessments of writing skill can be valid even if 
reliability is often a problem. To tji^de_g_ree_that they relate 
to actual performance in English composition courses or to 
more extensive assessment of writing performance^ direct 
assessments are valid measures. And substantial relation- 
ships with course performance have been reported. More- 
over; direct assessments have been shown to contribute, 
incrementally, beyond the prediction possible using past 
academic performance and objective test scores. Therefore, 
it is difficult to argue that direct assessments of writing skill 
are not valid. Such validity could be increased, however, by 
improvements in the reliabilities of direct assessments. 

A validity issue for which rib evidence was found is 
that related to the equating of essay assessments. Since 
topics arid specific tasks vary in difficulty, and since each 
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administration of a test must necessarily change the topic for 
security purposes, a not inconsequential problem is how 
best to equate a score received in one administration with a 
score received in anbthen This problem is usually handled 
through a combined essay and objective assessment in 
which the equating is performed on the combined score 
using an objective measure. However, if an essay assess- 
ment were used in isolation, it is not immediately apparent 
how equating across administrations could be achieved. 

An important but seldom examined, validity issue is 
concerned with the purposes of testing. If the purpose is to 
rank students, the direct assessments with holistic scoring 
are clearly valid for that purpose. But if the interest is in 
specific strengths and weaknesses of a student's writing for 
use as instructional feedback, a holistically scored essay is 
at best a blunr instrument. Analytic scoring may not be 
much better when the writing samples obtained represent 
only a very small proportion of the domain of possible 
samples. Therefore, the validity of direct assessments for 
diagnostic and instructional purposes can easily be ques- 
tioned despite the obvious instructional utility of commen- 
tary on one's writing. 

Validity in direct assessments has also been questioned 
with respect to issues of test bias, but no evidence on this 
issue was available. A specific question is whether judges 
discriminate against minorities and others who speak 
dialects and other languages. Hoover arid Politzer (1981), for 
example, observe that impressionistic judgments of the 
writing of speakers of dialects may be biased because the 
judge may react primarily to less important subskills (such 
as punctuation arid grammar) arid fail to note that other more 
important goals of the essay were achieved: The rating of 
subskills is suggested to minimize bias effects: Such 
subskill rating would increase the time required by judges. 

Recent analyses suggest technology offers some 
promise as a means of relieving the labor intensiveness of 
direct assessment, implementation of technology is not 
without problems, however, because some procedure is first 
needed for entering the sample into the computer and 
because the types of analyses that can be performed by 
computer are limited. Obviously, word processing is riot 
quite the same thing as cbmpbsition. Nonetheless, some 
aspects of good and bad writing can probably be evaluated 
by an appropriately programmed word processor. 

One must conclude, first, that writing skill is 
inherently difficult to assess accurately. While direct 
assessment accuracy is limited by rater inconsistencies and 
domain sampling problems, the indirect assessment of 
writing skill has other limitations: A second conclusion that 
is unavoidable is that assessment is labor intensive, 
expensive, and cumbersome: A means has yet to be devised 
that significandy relieves this efficiency problem, though 
computers may represent a potential long-term solution. 
.Fared with the present dilemma of either excessively high 
costs or low reliability, a solution is not easily found. 



Worse, some assessments in current use have high costs and 
low reliability. . . .. 

It may be that writing is simply top complex a skill to 
be measured completely. An approach that avoids sbriie 
difficulties is to focus bri specific support skills that are 
usually necessary but riot sufficient for effective writing. 
Knowledge of the rules of syntax, lexical knowledge, and 
spelling of course come to mind. But it is probably also 
possible to assess better than we now do other more 
advanced skills like organizational skills, coherence skills, 
transition skills, arid skills of revision and editing: 

Interestingly, it is this approach toward the assessment 
of specific skills that an English professor recently arrived at 
after facing sbrrie of the same problems described above. 
Matalene (1982) chronicles the experience of an English 
professor who became director of freshmen English at a 
large state university: After struggling with the complex 
political issues surrounding an exit examination, she 
decided to develop a test of her own — with the assistance of 
English department faculty members. An early step was a 
survey of English professors and teaching assistants. 

With the survey as the basis, a revision and editing test 
was developed which consists of 30 items divided into two 
parts: the first 15 questions deal with units larger than a 
sentence, the last 15 questions are bri how ib improve 
sentences or groups of sentences. The test is printed with 
the entire essay bri brie page of the test booklet. Questions 
ask students to 

• discover thesis and topic sentences; 

• judge level of language, voice, coherence, logic, arid 
diction 

• discern methods of development, errors of logic, 
unstated assUrriptibris, sentence variety, patterns of 
errors, effectiveness of examples; and 

• offer suggestions for revision. 

Following administrative and computer scoring of the 
test, each teacher receives a printout of his or her class 
which shows each student's answer to each question. After 
an extensive trial period, the test has now been made a 
requirement for cbrripletibri of freshman English. 

What seems important in this example is that the test 
developed is riot a test of writing skill per i& but a revision 
arid editing test, even though everyone (and certainly every 
English teacher) knows that there is more to writing than 
revising and editing: To the English professors and teaching 
assistants in this one university, however, these were the 
most important support skills. And a successful measure of 
these support skills was developed and is now in use. 

The example test described is not a direct assessment 
of writing skill, nor is it an indirect assessment of writing 
skill. It is a direct assessment of revision and editing skills. 
Similar direct assessments of other writing support skills 
would also seem to be possible. Direct assessments of 
written organization skilis, of thesis statements, of methods 
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of thesis development; and of the use of supporting evidence 
would also seem possible: Thus, the direct assessment of 
writing support skills represents one possible approach to 
the dilemma described earlier. 

Research into the assessment of these higher-level 
support skills is recommended. Also recommended is 
research of the following types: 

1. The development of a comprehensive criterion 
measure based on multiple writing samples written 
in different modes of discourse with each carefully 
evaluated by multiple judges on more than a single 
dimension: While in some senses similar to the 
Godshalk et al: (1966) study, an effort to develop a 
new writing criterion would benefit from more 
recent research on writing skill development. 
Furthermore, the new criterion could be used to 
evaluate new assessments of writing support skills. 
Hie collection of data on writing support skills 
using instruments such as those of Matalene (1982) 
and others and the analysis of such skills in relation 
to the overall variance in a comprehensive criterion 
would be especially useful. 

2. The conduct of confirmatory factor analyses as well 
as other kinds of analyses to examine the construct 
validity of the measures available as contrasted with 
the validity of new prototype measures. 

3. The analysis of judgmental assessments in conjunc- 
tion with automated assessments to determine in 
what ways these two approaches might be combined 
to optimize efficiency, reliability, and validity. 

4. The exploration of more efficient means for 
obtaining human judgments of written products. 
Such efficiency may be obtainable through the mail 
(particularly electronic mail) if appropriate quality 
control procedures are implemented at the same 
time: 

5. Since practicality usually dictates that only limited 
samples of an examinee's writing be taken, it would 
be important to examine what specific kinds of tasks 
elicit the most reliable and valid information. While 
persuasive/argumentative tasks may be preferred by 
English teachers, for example, they may be so 
difficult as to preclude much writing by many 
students. A comparative validity examination of 
task types would be valuable. 

6. Equating of direct assessments of writing is 
inherently difficult because tasks vary in difficulty. 
This problem is usually handled through the use of 
multiple-choice measures as anchors. If the exam- 
inee is allowed a choice of topics, the problem of 
equating is even more difficult. A useful investiga- 
tion would explore equating issues as they relate to 



task types, choice of tasks by the examinee, and 
optimum methods for weighting direct and indirect 
components. 

7: Bias in judgments of essays may be influenced by 
methods used, as has been suggested by Hoover 
and Politzer (1982). An examination of holistic as 
opposed to analytic ratings for different dialect and 
linguistic groups would provide a better under- 
standing of this issue. 
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